Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CNI STATUS Verb #123

Merged
merged 1 commit into from
Dec 6, 2024
Merged

Conversation

MikeZappa87
Copy link
Contributor

@MikeZappa87 MikeZappa87 commented Nov 18, 2024

We are implementing the CNI Status verb. The Status verb is to provide the container runtime the ability to determine if the runtime should call CNI ADD.

Please merge #122 first

Reference:
ocicni (cri-o/ocicni#196)

@MikeZappa87 MikeZappa87 force-pushed the issue/supportstatus branch 2 times, most recently from b47ffec to 8f10978 Compare November 18, 2024 19:32
@MikeZappa87 MikeZappa87 requested a review from mikebrow November 22, 2024 17:27
@MikeZappa87 MikeZappa87 marked this pull request as ready for review November 23, 2024 02:39
@MikeZappa87 MikeZappa87 marked this pull request as draft November 24, 2024 01:21
@MikeZappa87 MikeZappa87 marked this pull request as ready for review November 24, 2024 02:34
@mikebrow
Copy link
Member

We are implementing the CNI Status verb. The Status verb is to provide the container runtime the ability to determine if the runtime should call CNI ADD.

Please merge #122 first

Reference: ocicni (cri-o/ocicni#196)

122 merged rebase pls ..

@MikeZappa87
Copy link
Contributor Author

We are implementing the CNI Status verb. The Status verb is to provide the container runtime the ability to determine if the runtime should call CNI ADD.
Please merge #122 first
Reference: ocicni (cri-o/ocicni#196)

122 merged rebase pls ..

Will do once I get home

cni.go Show resolved Hide resolved
cni.go Show resolved Hide resolved
cni.go Show resolved Hide resolved
cni.go Outdated Show resolved Hide resolved
Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment..

cni.go Show resolved Hide resolved
@MikeZappa87 MikeZappa87 force-pushed the issue/supportstatus branch 2 times, most recently from 5762dbb to 37d969a Compare November 26, 2024 01:27
Signed-off-by: Michael Zappa <[email protected]>
@MikeZappa87
Copy link
Contributor Author

@squeed over here

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow mikebrow merged commit 642f1ce into containerd:main Dec 6, 2024
5 checks passed
@architkulkarni architkulkarni mentioned this pull request Dec 9, 2024
@dmcgowan dmcgowan added impact/changelog area/cri Container Runtime Interface (CRI) labels Dec 13, 2024
Mengkzhaoyun pushed a commit to open-beagle/containerd that referenced this pull request Dec 19, 2024
containerd 2.0.1

Welcome to the v2.0.1 release of containerd!

The first patch release for containerd 2.0 includes a number of bug fixes and improvements.

* Fix apply IoOwner options when not in user namespace ([#11151](containerd/containerd#11151))
* Fix cri grpc plugin config migration ([#11140](containerd/containerd#11140))
* Support CNI STATUS Verb ([containerd/go-cni#123](containerd/go-cni#123))

* Update differ to handle zstd media types ([#11068](containerd/containerd#11068))

* Update runc binary to v1.2.3 ([#11142](containerd/containerd#11142))
* Fix panic due to nil dereference cgroups v2 ([#11098](containerd/containerd#11098))

Please try out the release binaries and report any issues at
https://github.com/containerd/containerd/issues.

* Derek McGowan
* Wei Fu
* Archit Kulkarni
* Jin Dong
* Phil Estes
* Akhil Mohan
* Akihiro Suda
* Alexey Lunev
* Austin Vazquez
* Maksym Pavlenko
* Mike Brown
* Michael Zappa
* Samuel Karp
* Sebastiaan van Stijn
* Andrey Smirnov
* Davanum Srinivas
<details><summary>50 commits</summary>
<p>

* Prepare release notes for v2.0.1 ([#11158](containerd/containerd#11158))
  * [`b0ece5dc5`](containerd/containerd@b0ece5d) Prepare release notes for v2.0.1
* build(deps): bump actions/attest-build-provenance from 1.4.4 to 2.1.0 ([#11154](containerd/containerd#11154))
  * [`fe6957084`](containerd/containerd@fe69570) build(deps): bump actions/attest-build-provenance from 1.4.4 to 2.1.0
* update xx to v1.6.1 for compatibility with alpine 3.21 and file 5.46+ ([#11153](containerd/containerd#11153))
  * [`eb2ce6882`](containerd/containerd@eb2ce68) update xx to v1.6.1 for compatibility with alpine 3.21 and file 5.46+
* ctr pull should unpack for default platform when transfer service is used ([#11139](containerd/containerd#11139))
  * [`44cdca68b`](containerd/containerd@44cdca6) ctr pull unpack for default platform using transfer service
* Fix apply IoOwner options when not in user namespace ([#11151](containerd/containerd#11151))
  * [`018d83650`](containerd/containerd@018d836) internal/cri: should not apply IoOwner options
* Update go-cni for CNI STATUS ([#11146](containerd/containerd#11146))
  * [`5eb7995a9`](containerd/containerd@5eb7995) feat: update go-cni version for CNI STATUS
* Fix cri grpc plugin config migration ([#11140](containerd/containerd#11140))
  * [`a2302ea89`](containerd/containerd@a2302ea) Add integration test for custom configuration
  * [`be5eda069`](containerd/containerd@be5eda0) complete cri grpc config migration
* Update runc binary to v1.2.3 ([#11142](containerd/containerd#11142))
  * [`a53eff53d`](containerd/containerd@a53eff5) update runc binary to v1.2.3
* Update differ to handle zstd media types ([#11068](containerd/containerd#11068))
  * [`73f57acb0`](containerd/containerd@73f57ac) Update differ to handle zstd media types
* update to go1.23.4 / go1.22.10 ([#11109](containerd/containerd#11109))
  * [`290e8bc70`](containerd/containerd@290e8bc) update to go1.23.4 / go1.22.10
* CI: update Fedora to 41 ([#11110](containerd/containerd#11110))
  * [`62b790bfa`](containerd/containerd@62b790b) CI: update Fedora to 41
* Fix panic due to nil dereference cgroups v2 ([#11098](containerd/containerd#11098))
  * [`3ba2df924`](containerd/containerd@3ba2df9) fix panic due to nil dereference cgroups v2
* Publish attestation as release artifact ([#11067](containerd/containerd#11067))
  * [`34a45cab2`](containerd/containerd@34a45ca) Publish attestation as release artifact
* Move rockylinux 9.4 to almalinux/9 in CI ([#11053](containerd/containerd#11053))
  * [`7dec6b460`](containerd/containerd@7dec6b4) move rocky 9.4 to almalinux/9 in CI
* *: should align pipe's owner with init process ([#11035](containerd/containerd#11035))
  * [`cf07f28ee`](containerd/containerd@cf07f28) *: should align pipe's owner with init process
* fix: set the credentials even if not provided ([#11031](containerd/containerd#11031))
  * [`986088866`](containerd/containerd@9860888) fix: set the credentials even if not provided
* fsverity_test.go: fix nil pointer derefence, fix test fail, fix minor/major device numbers resolving ([#10978](containerd/containerd#10978))
  * [`30b929ece`](containerd/containerd@30b929e) fsverity_test.go: fix major/minor device number resolving
  * [`10996a334`](containerd/containerd@10996a3) fsverity_test.go: fix nil pointer dereference, fix test fail
* update runc binary to 1.2.2 ([#11023](containerd/containerd#11023))
  * [`9081e979f`](containerd/containerd@9081e97) update runc binary to 1.2.2
* Revert "Disable vagrant strict dependency checking" ([#11009](containerd/containerd#11009))
  * [`6399c936f`](containerd/containerd@6399c93) Revert "Disable vagrant strict dependency checking"
* fsverity_linux.go: Fix fsverity.IsEnabled() for big endian systems ([#11005](containerd/containerd#11005))
  * [`a7f2b562f`](containerd/containerd@a7f2b56) fsverity_linux.go: Fix fsverity.IsEnabled() for big endian systems
* bump github.com/containerd/typeurl/v2 from 2.2.2 to 2.2.3 ([#10997](containerd/containerd#10997))
  * [`389e781ea`](containerd/containerd@389e781) build(deps): bump github.com/containerd/typeurl/v2 from 2.2.2 to 2.2.3
* update to go1.23.3 / go1.22.9 ([#10973](containerd/containerd#10973))
  * [`5b879f30c`](containerd/containerd@5b879f3) update to go1.23.3 / go1.22.9
* ci: enable marking 2.0 releases as latest ([#10963](containerd/containerd#10963))
  * [`458215f6c`](containerd/containerd@458215f) ci: enable marking 2.0 releases as latest
* Avoid arch info in the sed/replace when building cri-cni-containerd.tar.gz ([#10968](containerd/containerd#10968))
  * [`e99c2b55c`](containerd/containerd@e99c2b5) Avoid arch info in the sed/replace when building cri-cni-containerd.tar.gz
</p>
</details>
<details><summary>7 commits</summary>
<p>

* Support CNI STATUS Verb ([containerd/go-cni#123](containerd/go-cni#123))
  * [`208eca9`](containerd/go-cni@208eca9) support CNI status verb
* Bump github actions dependencies to match containerd CI repo and fix lint ([containerd/go-cni#122](containerd/go-cni#122))
  * [`386f475`](containerd/go-cni@386f475) Fix ci.yml indent
  * [`a9b0675`](containerd/go-cni@a9b0675) Another doc commit to trigger lint?
  * [`14af454`](containerd/go-cni@14af454) Bump github actions dependency versions
  * [`9e0d096`](containerd/go-cni@9e0d096) Trivial doc commit to trigger lint
</p>
</details>

* **github.com/containerd/go-cni**      v1.1.10 -> v1.1.11
* **github.com/containerd/typeurl/v2**  v2.2.2 -> v2.2.3

Previous release can be found at [v2.0.0](https://github.com/containerd/containerd/releases/tag/v2.0.0)
* `containerd-<VERSION>-<OS>-<ARCH>.tar.gz`:         ✅Recommended. Dynamically linked with glibc 2.31 (Ubuntu 20.04).
* `containerd-static-<VERSION>-<OS>-<ARCH>.tar.gz`:  Statically linked. Expected to be used on non-glibc Linux distributions. Not position-independent.

In addition to containerd, typically you will have to install [runc](https://github.com/opencontainers/runc/releases)
and [CNI plugins](https://github.com/containernetworking/plugins/releases) from their official sites too.

See also the [Getting Started](https://github.com/containerd/containerd/blob/main/docs/getting-started.md) documentation.
@buroa
Copy link

buroa commented Dec 19, 2024

Hey @mikebrow (cc @MikeZappa87), this seems to cause issues with dual-CNI setups. I am using both Cilium + Multus, and during a node reboot; the node reports Ready and then shortly after goes NotReady. I was not seeing this issue with containerd=2.0.0. It's also random, like 50/50.

Kubelet just spams these logs when it happens:

m0: {"ts":1734619253013.3718,"caller":"kubelet/kubelet.go:2412","msg":"Skipping pod synchronization","err":"container runtime is down","errCauses":[{"error":"container runtime is down"}]}
m0: {"ts":1734619256213.9617,"caller":"kubelet/kubelet.go:2412","msg":"Skipping pod synchronization","err":"container runtime is down","errCauses":[{"error":"container runtime is down"}]}
m0: {"ts":1734619259270.4944,"caller":"nodestatus/setters.go:602","msg":"Node became not ready","v":0,"node":{"name":"m0"},"condition":{"type":"Ready","status":"False","lastHeartbeatTime":"2024-12-19T14:40:59Z","lastTransitionTime":"2024-12-19T14:40:59Z","reason":"KubeletNotReady","message":"container runtime is down"}}

@buroa
Copy link

buroa commented Dec 19, 2024

@buroa what is your CNI configuration? I didn't actually know Cilium or Multus supported STATUS yet.

Thanks for the reply @MikeZappa87. Here are the files inside /etc/cni/net.d:

/etc/cni/net.d/00-multus.conf

{"binDir":"/opt/cni/bin","cniVersion":"0.3.1","logLevel":"error","name":"multus-cni-network","clusterNetwork":"/etc/cni/net.d/05-cilium.conflist","type":"multus-shim"}

/etc/cni/net.d/05-cilium.conflist

{
  "cniVersion": "0.3.1",
  "name": "cilium",
  "plugins": [
    {
       "type": "cilium-cni",
       "enable-debug": false,
       "log-file": "/var/run/cilium/cilium-cni.log"
    }
  ]
}

@MikeZappa87
Copy link
Contributor Author

MikeZappa87 commented Dec 19, 2024

@buroa what is your CNI configuration? I didn't actually know Cilium or Multus supported STATUS yet.

Thanks for the reply @MikeZappa87. Here are the files inside /etc/cni/net.d:

/etc/cni/net.d/00-multus.conf

{"binDir":"/opt/cni/bin","cniVersion":"0.3.1","logLevel":"error","name":"multus-cni-network","clusterNetwork":"/etc/cni/net.d/05-cilium.conflist","type":"multus-shim"}

/etc/cni/net.d/05-cilium.conflist

{
  "cniVersion": "0.3.1",
  "name": "cilium",
  "plugins": [
    {
       "type": "cilium-cni",
       "enable-debug": false,
       "log-file": "/var/run/cilium/cilium-cni.log"
    }
  ]
}

The CNI status would return nil for these. I wonder if something else is going on? Are you able to get the containerd logs and perhaps the kubectl describe node output?

@buroa
Copy link

buroa commented Dec 19, 2024

@MikeZappa87 I have some log files, but not sure if they are helpful.

cri.log
containerd.log
kubelet.log

@MikeZappa87
Copy link
Contributor Author

The level would need to be increased on the runtime to actually determine what is failing

@MikeZappa87
Copy link
Contributor Author

@buroa can you give the output of kubectl describe node on one of the impacted nodes?

@buroa
Copy link

buroa commented Dec 19, 2024

@MikeZappa87 Doesn't show anything that isn't known already: container runtime is down

Name:               m0
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    extensions.talos.dev/gasket-driver=5815ee3-v1.9.0
                    extensions.talos.dev/i915=20241110-v1.9.0
                    extensions.talos.dev/intel-ucode=20241112
                    extensions.talos.dev/mei=v1.9.0
                    extensions.talos.dev/modules.dep=6.12.5-talos
                    extensions.talos.dev/thunderbolt=v1.9.0
                    feature.node.kubernetes.io/pci-0300_8086.present=true
                    feature.node.kubernetes.io/pci-0300_8086.sriov.capable=true
                    feature.node.kubernetes.io/system-os_release.ID=talos
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=v1.9.0
                    google.feature.node.kubernetes.io/coral=true
                    intel.feature.node.kubernetes.io/gpu=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=m0
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    plan.upgrade.cattle.io/kubernetes=85eb67ef34eceaf0062c60a69ce05414ede42f6351d891d58f03dea4
                    plan.upgrade.cattle.io/talos=5e9a7ad8626d17ceefe4d4bfde01b3382385ef64b9f63dce0b7e69b3
                    topology.kubernetes.io/region=k8s
                    topology.kubernetes.io/zone=m
Annotations:        cluster.talos.dev/node-id: S098h66G6V6celySmLiPChKb7HSqn9wSoZ5zWpDN9tP
                    csi.volume.kubernetes.io/nodeid: {"rook-ceph.cephfs.csi.ceph.com":"m0","rook-ceph.rbd.csi.ceph.com":"m0"}
                    extensions.talos.dev/schematic: 0db147986e938b3f029f7f25aa850df4e6205873ad38ba6470d2ae0d9cff9ae7
                    networking.talos.dev/api-server-port: 6443
                    networking.talos.dev/self-ips: 192.168.10.10,192.168.10.200,fd64:a827:e1e0:fe49:5a47:caff:fe77:c58e
                    nfd.node.kubernetes.io/feature-labels:
                      google.feature.node.kubernetes.io/coral,intel.feature.node.kubernetes.io/gpu,pci-0300_8086.present,pci-0300_8086.sriov.capable,system-os_r...
                    node.alpha.kubernetes.io/ttl: 0
                    talos.dev/owned-annotations: ["extensions.talos.dev/schematic"]
                    talos.dev/owned-labels:
                      ["extensions.talos.dev/gasket-driver","extensions.talos.dev/i915","extensions.talos.dev/intel-ucode","extensions.talos.dev/mei","extension...
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 06 Dec 2024 16:31:48 -0600
Taints:             node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  m0
  AcquireTime:     <unset>
  RenewTime:       Thu, 19 Dec 2024 11:53:34 -0600
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 06 Dec 2024 16:32:35 -0600   Fri, 06 Dec 2024 16:32:35 -0600   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Thu, 19 Dec 2024 11:53:35 -0600   Thu, 19 Dec 2024 11:52:34 -0600   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 19 Dec 2024 11:53:35 -0600   Thu, 19 Dec 2024 11:52:34 -0600   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 19 Dec 2024 11:53:35 -0600   Thu, 19 Dec 2024 11:52:34 -0600   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                False   Thu, 19 Dec 2024 11:53:35 -0600   Thu, 19 Dec 2024 11:53:35 -0600   KubeletNotReady              container runtime is down
Addresses:
  InternalIP:  192.168.10.10
  Hostname:    m0
Capacity:
  cpu:                            20
  ephemeral-storage:              1873227100Ki
  gpu.intel.com/i915:             0
  gpu.intel.com/i915_monitoring:  0
  hugepages-1Gi:                  0
  hugepages-2Mi:                  2Gi
  memory:                         98584596Ki
  pods:                           150
  squat.ai/coral:                 0
Allocatable:
  cpu:                            19950m
  ephemeral-storage:              1726097657046
  gpu.intel.com/i915:             0
  gpu.intel.com/i915_monitoring:  0
  hugepages-1Gi:                  0
  hugepages-2Mi:                  2Gi
  memory:                         95860756Ki
  pods:                           150
  squat.ai/coral:                 0
System Info:
  Machine ID:                 7dbe9a1d2b67cb342587b3d490fdb654
  System UUID:                26a18500-1912-11ef-b778-63539f479f00
  Boot ID:                    a995e415-8eda-43a3-81ee-45db1bf01359
  Kernel Version:             6.12.5-talos
  OS Image:                   Talos (v1.9.0)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://2.0.1
  Kubelet Version:            v1.32.0
  Kube-Proxy Version:         v1.32.0
PodCIDR:                      10.244.1.0/24
PodCIDRs:                     10.244.1.0/24
Non-terminated Pods:          (24 in total)
  Namespace                   Name                                             CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                             ------------  ----------  ---------------  -------------  ---
  databases                   postgres-1                                       500m (2%)     0 (0%)      4Gi (4%)         4Gi (4%)       2m55s
  kube-system                 cilium-9kffv                                     100m (0%)     0 (0%)      10Mi (0%)        0 (0%)         2m35s
  kube-system                 cilium-operator-6fd5fcb79-sbn6x                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m38s
  kube-system                 generic-device-plugin-dt675                      10m (0%)      0 (0%)      64Mi (0%)        64Mi (0%)      40s
  kube-system                 intel-gpu-plugin-intel-gpu-plugin-msznr          40m (0%)      100m (0%)   45Mi (0%)        90Mi (0%)      39s
  kube-system                 irqbalance-mkf8w                                 25m (0%)      0 (0%)      128Mi (0%)       128Mi (0%)     39s
  kube-system                 kube-apiserver-m0                                200m (1%)     0 (0%)      512Mi (0%)       0 (0%)         48s
  kube-system                 kube-controller-manager-m0                       50m (0%)      0 (0%)      256Mi (0%)       0 (0%)         48s
  kube-system                 kube-scheduler-m0                                10m (0%)      0 (0%)      64Mi (0%)        0 (0%)         48s
  kube-system                 node-feature-discovery-worker-jm9s8              5m (0%)       0 (0%)      64Mi (0%)        512Mi (0%)     40s
  kube-system                 spegel-7mq8x                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m35s
  monitoring                  node-exporter-msgql                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m35s
  monitoring                  promtail-dsbtk                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         40s
  monitoring                  smartctl-exporter-0-xvjfl                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m35s
  networking                  multus-9ksbg                                     10m (0%)      0 (0%)      1Gi (1%)         1Gi (1%)       39s
  rook-ceph                   csi-cephfsplugin-provisioner-7ddf4b85cc-tfc59    700m (3%)     0 (0%)      1152Mi (1%)      2304Mi (2%)    176m
  rook-ceph                   csi-cephfsplugin-rrkx2                           350m (1%)     0 (0%)      768Mi (0%)       1536Mi (1%)    28h
  rook-ceph                   csi-rbdplugin-bqvpr                              350m (1%)     0 (0%)      768Mi (0%)       1536Mi (1%)    40s
  rook-ceph                   csi-rbdplugin-provisioner-7664f748b-4hmpk        450m (2%)     0 (0%)      1152Mi (1%)      2304Mi (2%)    176m
  rook-ceph                   rook-ceph-crashcollector-m0-65967d64d5-qwknp     100m (0%)     0 (0%)      60Mi (0%)        60Mi (0%)      48s
  rook-ceph                   rook-ceph-exporter-m0-84f99bb646-shmc2           50m (0%)      0 (0%)      50Mi (0%)        128Mi (0%)     48s
  rook-ceph                   rook-ceph-mon-e-756cdd5646-tj8fr                 1100m (5%)    0 (0%)      1124Mi (1%)      3Gi (3%)       2m35s
  rook-ceph                   rook-ceph-osd-1-74ff454fd-mb4d6                  1100m (5%)    0 (0%)      4196Mi (4%)      5Gi (5%)       2m34s
  rook-ceph                   rook-discover-db5cs                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         39s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests       Limits
  --------                       --------       ------
  cpu                            5150m (25%)    100m (0%)
  memory                         15533Mi (16%)  21974Mi (23%)
  ephemeral-storage              0 (0%)         0 (0%)
  hugepages-1Gi                  0 (0%)         0 (0%)
  hugepages-2Mi                  2Gi (100%)     2Gi (100%)
  gpu.intel.com/i915             0              0
  gpu.intel.com/i915_monitoring  0              0
  squat.ai/coral                 0              0
Events:
  Type     Reason                   Age                   From             Message
  ----     ------                   ----                  ----             -------
  Normal   NodeNotReady             3m20s (x2 over 3h5m)  kubelet          Node m0 status is now: NodeNotReady
  Normal   Shutdown                 3m20s                 kubelet          Shutdown manager detected shutdown event
  Normal   RegisteredNode           2m39s                 node-controller  Node m0 event: Registered Node m0 in Controller
  Normal   Starting                 68s                   kubelet          Starting kubelet.
  Warning  InvalidDiskCapacity      68s                   kubelet          invalid capacity 0 on image filesystem
  Normal   NodeAllocatableEnforced  68s                   kubelet          Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  62s (x8 over 68s)     kubelet          Node m0 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    62s (x8 over 68s)     kubelet          Node m0 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     62s (x7 over 68s)     kubelet          Node m0 status is now: NodeHasSufficientPID
  Warning  Rebooted                 62s                   kubelet          Node m0 has been rebooted, boot id: a995e415-8eda-43a3-81ee-45db1bf01359

@MikeZappa87
Copy link
Contributor Author

I am looking for 'Network plugin returns error' since I don't think the issue you are seeing is related to the change.

"NetworkUnavailable False Fri, 06 Dec 2024 16:32:35 -0600 Fri, 06 Dec 2024 16:32:35 -0600 CiliumIsUp Cilium is running on this node"

@MikeZappa87
Copy link
Contributor Author

If it was CNI Status we would see something like:

container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

@buroa
Copy link

buroa commented Dec 19, 2024

@MikeZappa87 I set the containerd to debug mode and captured some more things. Here are the logs:

containerd.log
cri.log
kubelet.log

@MikeZappa87
Copy link
Contributor Author

Is rook not in a healthy state? This might be the problem. I am not familiar with this but right now the errors aren't indicating anything related to the CNI. @mikebrow keep me honest

@buroa
Copy link

buroa commented Dec 19, 2024

@MikeZappa87 It's as healthy as it can. Pods drain when I reboot the node and then they are rescheduled. Rook pods are quite quick to come up, so that are kinda spammed in the logs.

I have another buddy that runs the same stack and is having no problems, but they also do not run multus. Once I remove multus, my cluster is as healthy as can be. Again, had no problems with containerd=2.0.0 until containerd=2.0.1.

@MikeZappa87
Copy link
Contributor Author

To clarify, this works with containerd 2.0.1 with just cilium and when multus is added it breaks?

@buroa
Copy link

buroa commented Dec 19, 2024

@MikeZappa87 Yep.

@MikeZappa87
Copy link
Contributor Author

MikeZappa87 commented Dec 19, 2024

I thought multus rewrote the cni conf file? The provided doesn’t show that.

In the cilium config map what is the value of

cni-exclusive

this might be more of a cilium issue

@mikebrow
Copy link
Member

The config.toml is probably messed up for your 2.0.1 containerd.. I'm not seeing the CRI plugin start.. It should look more like this:

root@ubnt:~# containerd -l debug
INFO[2024-12-19T12:10:45.293432341-06:00] starting containerd                           revision=d9a58a892b77f292b842b849f059d8d8e8972b4a version=v2.0.0-rc.1-36-gd9a58a892
INFO[2024-12-19T12:10:45.303370039-06:00] loading plugin                                id=io.containerd.image-verifier.v1.bindir type=io.containerd.image-verifier.v1
INFO[2024-12-19T12:10:45.303407613-06:00] loading plugin                                id=io.containerd.internal.v1.opt type=io.containerd.internal.v1
INFO[2024-12-19T12:10:45.303426969-06:00] loading plugin                                id=io.containerd.warning.v1.deprecations type=io.containerd.warning.v1
INFO[2024-12-19T12:10:45.303432886-06:00] loading plugin                                id=io.containerd.event.v1.exchange type=io.containerd.event.v1
INFO[2024-12-19T12:10:45.303443454-06:00] loading plugin                                id=io.containerd.monitor.task.v1.cgroups type=io.containerd.monitor.task.v1
INFO[2024-12-19T12:10:45.303654744-06:00] loading plugin                                id=io.containerd.content.v1.content type=io.containerd.content.v1
INFO[2024-12-19T12:10:45.303690135-06:00] loading plugin                                id=io.containerd.snapshotter.v1.blockfile type=io.containerd.snapshotter.v1
INFO[2024-12-19T12:10:45.303716009-06:00] skip loading plugin                           error="no scratch file generator: skip plugin" id=io.containerd.snapshotter.v1.blockfile type=io.containerd.snapshotter.v1
INFO[2024-12-19T12:10:45.303887261-06:00] loading plugin                                id=io.containerd.snapshotter.v1.btrfs type=io.containerd.snapshotter.v1
INFO[2024-12-19T12:10:45.304105633-06:00] skip loading plugin                           error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs (ext4) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" id=io.containerd.snapshotter.v1.btrfs type=io.containerd.snapshotter.v1
INFO[2024-12-19T12:10:45.304120786-06:00] loading plugin                                id=io.containerd.snapshotter.v1.devmapper type=io.containerd.snapshotter.v1
INFO[2024-12-19T12:10:45.304146023-06:00] skip loading plugin                           error="devmapper not configured: skip plugin" id=io.containerd.snapshotter.v1.devmapper type=io.containerd.snapshotter.v1
INFO[2024-12-19T12:10:45.304152474-06:00] loading plugin                                id=io.containerd.snapshotter.v1.native type=io.containerd.snapshotter.v1
INFO[2024-12-19T12:10:45.304166490-06:00] loading plugin                                id=io.containerd.snapshotter.v1.overlayfs type=io.containerd.snapshotter.v1
INFO[2024-12-19T12:10:45.304212491-06:00] loading plugin                                id=io.containerd.metadata.v1.bolt type=io.containerd.metadata.v1
INFO[2024-12-19T12:10:45.304222501-06:00] metadata content store policy set             policy=shared
INFO[2024-12-19T12:10:45.313239232-06:00] loading plugin                                id=io.containerd.gc.v1.scheduler type=io.containerd.gc.v1
INFO[2024-12-19T12:10:45.313323425-06:00] loading plugin                                id=io.containerd.shim.v1.shim type=io.containerd.shim.v1
INFO[2024-12-19T12:10:45.313354616-06:00] loading plugin                                id=io.containerd.runtime.v2.task type=io.containerd.runtime.v2
INFO[2024-12-19T12:10:45.313397266-06:00] loading plugin                                id=io.containerd.differ.v1.walking type=io.containerd.differ.v1
INFO[2024-12-19T12:10:45.313414312-06:00] loading plugin                                id=io.containerd.lease.v1.manager type=io.containerd.lease.v1
INFO[2024-12-19T12:10:45.313432334-06:00] loading plugin                                id=io.containerd.sandbox.controller.v1.shim type=io.containerd.sandbox.controller.v1
INFO[2024-12-19T12:10:45.313901386-06:00] loading plugin                                id=io.containerd.service.v1.containers-service type=io.containerd.service.v1
INFO[2024-12-19T12:10:45.313922296-06:00] loading plugin                                id=io.containerd.service.v1.content-service type=io.containerd.service.v1
INFO[2024-12-19T12:10:45.313938676-06:00] loading plugin                                id=io.containerd.service.v1.diff-service type=io.containerd.service.v1
INFO[2024-12-19T12:10:45.313984695-06:00] loading plugin                                id=io.containerd.service.v1.images-service type=io.containerd.service.v1
INFO[2024-12-19T12:10:45.314005197-06:00] loading plugin                                id=io.containerd.service.v1.introspection-service type=io.containerd.service.v1
INFO[2024-12-19T12:10:45.314020957-06:00] loading plugin                                id=io.containerd.service.v1.namespaces-service type=io.containerd.service.v1
INFO[2024-12-19T12:10:45.314036685-06:00] loading plugin                                id=io.containerd.service.v1.snapshots-service type=io.containerd.service.v1
INFO[2024-12-19T12:10:45.314051508-06:00] loading plugin                                id=io.containerd.service.v1.tasks-service type=io.containerd.service.v1
DEBU[2024-12-19T12:10:45.314085800-06:00] No blockio config file specified, blockio not configured 
DEBU[2024-12-19T12:10:45.314095023-06:00] No RDT config file specified, RDT not configured 
INFO[2024-12-19T12:10:45.314106423-06:00] loading plugin                                id=io.containerd.grpc.v1.containers type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.314124183-06:00] loading plugin                                id=io.containerd.grpc.v1.content type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.314139577-06:00] loading plugin                                id=io.containerd.grpc.v1.diff type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.314155355-06:00] loading plugin                                id=io.containerd.grpc.v1.events type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.314169742-06:00] loading plugin                                id=io.containerd.grpc.v1.images type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.314191833-06:00] loading plugin                                id=io.containerd.grpc.v1.introspection type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.314206658-06:00] loading plugin                                id=io.containerd.grpc.v1.leases type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.314222130-06:00] loading plugin                                id=io.containerd.grpc.v1.namespaces type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.314237793-06:00] loading plugin                                id=io.containerd.sandbox.store.v1.local type=io.containerd.sandbox.store.v1
INFO[2024-12-19T12:10:45.314257274-06:00] loading plugin                                id=io.containerd.cri.v1.images type=io.containerd.cri.v1
WARN[2024-12-19T12:10:45.314360694-06:00] Ignoring unknown key in TOML for plugin       error="strict mode: fields in the document are missing in the target struct" key=PinnedImages plugin=io.containerd.cri.v1.images
INFO[2024-12-19T12:10:45.314584914-06:00] Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs" for snapshotter "overlayfs" 
INFO[2024-12-19T12:10:45.314599583-06:00] Start snapshots syncer                       
INFO[2024-12-19T12:10:45.314824847-06:00] loading plugin                                id=io.containerd.cri.v1.runtime type=io.containerd.cri.v1
INFO[2024-12-19T12:10:45.315374008-06:00] starting cri plugin                           config="{\"containerd\":{\"defaultRuntimeName\":\"runc\",\"runtimes\":{\"runc\":{\"runtimeType\":\"io.containerd.runc.v2\",\"runtimePath\":\"\",\"PodAnnotations\":[],\"ContainerAnnotations\":[],\"options\":{\"BinaryName\":\"\",\"CriuImagePath\":\"\",\"CriuWorkPath\":\"\",\"IoGid\":0,\"IoUid\":0,\"NoNewKeyring\":false,\"Root\":\"\",\"ShimCgroup\":\"\"},\"privileged_without_host_devices\":false,\"privileged_without_host_devices_all_devices_allowed\":false,\"baseRuntimeSpec\":\"\",\"cniConfDir\":\"\",\"cniMaxConfNum\":0,\"snapshotter\":\"\",\"sandboxer\":\"podsandbox\"}},\"ignoreBlockIONotEnabledErrors\":false,\"ignoreRdtNotEnabledErrors\":false},\"cni\":{\"binDir\":\"/opt/cni/bin\",\"confDir\":\"/etc/cni/net.d\",\"maxConfNum\":1,\"setupSerially\":false,\"confTemplate\":\"\",\"ipPref\":\"\"},\"enableSelinux\":false,\"selinuxCategoryRange\":1024,\"maxContainerLogSize\":16384,\"disableCgroup\":false,\"disableApparmor\":false,\"restrictOOMScoreAdj\":false,\"disableProcMount\":false,\"unsetSeccompProfile\":\"\",\"tolerateMissingHugetlbController\":true,\"disableHugetlbController\":true,\"device_ownership_from_security_context\":false,\"ignoreImageDefinedVolumes\":false,\"netnsMountsUnderStateDir\":false,\"enableUnprivilegedPorts\":true,\"enableUnprivilegedICMP\":true,\"enableCDI\":true,\"cdiSpecDirs\":[\"/etc/cdi\",\"/var/run/cdi\"],\"drainExecSyncIOTimeout\":\"0s\",\"ignoreDeprecationWarnings\":null,\"containerdRootDir\":\"/var/lib/containerd\",\"containerdEndpoint\":\"/run/containerd/containerd.sock\",\"rootDir\":\"/var/lib/containerd/io.containerd.grpc.v1.cri\",\"stateDir\":\"/run/containerd/io.containerd.grpc.v1.cri\"}"
INFO[2024-12-19T12:10:45.315445198-06:00] loading plugin                                id=io.containerd.sandbox.controller.v1.podsandbox type=io.containerd.sandbox.controller.v1
INFO[2024-12-19T12:10:45.315723885-06:00] loading plugin                                id=io.containerd.grpc.v1.sandbox-controllers type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.315748116-06:00] loading plugin                                id=io.containerd.grpc.v1.sandboxes type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.315766307-06:00] loading plugin                                id=io.containerd.grpc.v1.snapshots type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.315781243-06:00] loading plugin                                id=io.containerd.streaming.v1.manager type=io.containerd.streaming.v1
INFO[2024-12-19T12:10:45.315802289-06:00] loading plugin                                id=io.containerd.grpc.v1.streaming type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.315819263-06:00] loading plugin                                id=io.containerd.grpc.v1.tasks type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.315834860-06:00] loading plugin                                id=io.containerd.transfer.v1.local type=io.containerd.transfer.v1
INFO[2024-12-19T12:10:45.315884969-06:00] loading plugin                                id=io.containerd.grpc.v1.transfer type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.315901697-06:00] loading plugin                                id=io.containerd.grpc.v1.version type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.315917393-06:00] loading plugin                                id=io.containerd.monitor.container.v1.restart type=io.containerd.monitor.container.v1
INFO[2024-12-19T12:10:45.316288305-06:00] loading plugin                                id=io.containerd.tracing.processor.v1.otlp type=io.containerd.tracing.processor.v1
INFO[2024-12-19T12:10:45.316326547-06:00] skip loading plugin                           error="skip plugin: tracing endpoint not configured" id=io.containerd.tracing.processor.v1.otlp type=io.containerd.tracing.processor.v1
INFO[2024-12-19T12:10:45.316342044-06:00] loading plugin                                id=io.containerd.internal.v1.tracing type=io.containerd.internal.v1
INFO[2024-12-19T12:10:45.316395219-06:00] skip loading plugin                           error="skip plugin: tracing endpoint not configured" id=io.containerd.internal.v1.tracing type=io.containerd.internal.v1
INFO[2024-12-19T12:10:45.316409121-06:00] loading plugin                                id=io.containerd.grpc.v1.healthcheck type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.316429941-06:00] loading plugin                                id=io.containerd.nri.v1.nri type=io.containerd.nri.v1
INFO[2024-12-19T12:10:45.316637619-06:00] runtime interface created                    
INFO[2024-12-19T12:10:45.316646499-06:00] created NRI interface                        
INFO[2024-12-19T12:10:45.316802803-06:00] loading plugin                                id=io.containerd.grpc.v1.cri type=io.containerd.grpc.v1
INFO[2024-12-19T12:10:45.316861173-06:00] Connect containerd service                   
INFO[2024-12-19T12:10:45.316916685-06:00] using experimental NRI integration - disable nri plugin to prevent this 
DEBU[2024-12-19T12:10:45.361817960-06:00] runtime "runc" supports recursive read-only mounts 
DEBU[2024-12-19T12:10:45.361848612-06:00] runtime "runc" supports CRI userns: false    
INFO[2024-12-19T12:10:45.362067964-06:00] Start subscribing containerd event           
INFO[2024-12-19T12:10:45.362162236-06:00] Start recovering state                       
INFO[2024-12-19T12:10:45.362503534-06:00] serving...                                    address=/run/containerd/containerd.sock.ttrpc
INFO[2024-12-19T12:10:45.362573489-06:00] serving...                                    address=/run/containerd/containerd.sock
DEBU[2024-12-19T12:10:45.363628821-06:00] Loaded sandbox {Metadata:{ID:e6a7d015b8474c9b22b5950aec621ef248b21a1207cb84ceeb6bf466e9b525b7 Name:busybox-sandbox_default_hdishd83djaidwnduwk28bcsb_1 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:busybox-sandbox,Uid:hdishd83djaidwnduwk28bcsb,Namespace:default,Attempt:1,},Hostname:,LogDirectory:,DnsConfig:nil,PortMappings:[]*PortMapping{},Labels:map[string]string{},Annotations:map[string]string{},Linux:&LinuxPodSandboxConfig{CgroupParent:,SecurityContext:nil,Sysctls:map[string]string{},Overhead:nil,Resources:nil,},Windows:nil,} NetNSPath:/var/run/netns/cni-18c5d1a2-6d5a-29c8-b147-999c4433ef39 IP:10.88.0.32 AdditionalIPs:[2001:4860:4860::20] RuntimeHandler: CNIResult:0xc000696370 ProcessLabel:} Status:0xc0001ee2a0 Container:0xc00068c0e0 Sandboxer:podsandbox NetNS:0xc0002daf30 StopCh:0xc0001f1200 Stats:<nil>} 
DEBU[2024-12-19T12:10:45.373067168-06:00] Loaded container {Metadata:{ID:95828b1b1c6ceaca528c2008c35d1941e03335e0464e0cc61421cb917b7d0977 Name:busybox_busybox-sandbox_default_hdishd83djaidwnduwk28bcsb_0 SandboxID:e6a7d015b8474c9b22b5950aec621ef248b21a1207cb84ceeb6bf466e9b525b7 Config:&ContainerConfig{Metadata:&ContainerMetadata{Name:busybox,Attempt:0,},Image:&ImageSpec{Image:busybox:1.35.0,Annotations:map[string]string{},UserSpecifiedImage:busybox:1.35.0,RuntimeHandler:,},Command:[top],Args:[],WorkingDir:,Envs:[]*KeyValue{},Mounts:[]*Mount{},Devices:[]*Device{},Labels:map[string]string{},Annotations:map[string]string{},LogPath:,Stdin:false,StdinOnce:false,Tty:false,Linux:&LinuxContainerConfig{Resources:nil,SecurityContext:nil,},Windows:nil,CDIDevices:[]*CDIDevice{},} ImageRef:sha256:0c00acac9c2794adfa8bb7b13ef38504300b505a043bf68dff7a00068dcc732b LogPath: StopSignal: ProcessLabel:} Status:0xc00036c100 Container:0xc0007881c0 IO:<nil> StopCh:0xc0001f0a68 IsStopSignaledWithTimeout:0xc0001581a0 Stats:<nil>} 
DEBU[2024-12-19T12:10:45.379176699-06:00] Loaded image "sha256:6270bb605e12e581514ada5fd5b3216f727db55dc87d5889c790e4c760683fee" 
DEBU[2024-12-19T12:10:45.380860492-06:00] Loaded image "docker.io/library/busybox:1.35.0" 
DEBU[2024-12-19T12:10:45.381672580-06:00] Loaded image "registry.k8s.io/pause:3.9"     
DEBU[2024-12-19T12:10:45.382735678-06:00] Loaded image "registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db" 
DEBU[2024-12-19T12:10:45.383665155-06:00] Loaded image "sha256:e6f1816883972d4be47bd48879a08919b96afcd344132622e4d444987919323c" 
DEBU[2024-12-19T12:10:45.384496459-06:00] Loaded image "registry.k8s.io/pause@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097" 
DEBU[2024-12-19T12:10:45.385428744-06:00] Loaded image "registry.k8s.io/pause:3.6"     
DEBU[2024-12-19T12:10:45.386619629-06:00] Loaded image "registry.k8s.io/coredns/coredns:v1.11.1" 
DEBU[2024-12-19T12:10:45.387503236-06:00] Loaded image "sha256:cbb01a7bd410dc08ba382018ab909a674fb0e48687f0c00797ed5bc34fcc6bb4" 
DEBU[2024-12-19T12:10:45.389029206-06:00] Loaded image "docker.io/library/busybox@sha256:02289a9972c5024cd2f083221f6903786e7f4cb4a9a9696f665d20dd6892e5d6" 
DEBU[2024-12-19T12:10:45.389899900-06:00] Loaded image "registry.k8s.io/coredns/coredns@sha256:1eeb4c7316bacb1d4c8ead65571cd92dd21e27359f0d4917f1a5822a73b75db1" 
DEBU[2024-12-19T12:10:45.392043091-06:00] Loaded image "docker.io/library/nginx@sha256:ea97e6aace270d82c73da382ea1a8c42d44b9dc11b55159104e21c49c687e7fb" 
DEBU[2024-12-19T12:10:45.394190622-06:00] Loaded image "docker.io/library/nginx:latest" 
DEBU[2024-12-19T12:10:45.395752857-06:00] Loaded image "sha256:0c00acac9c2794adfa8bb7b13ef38504300b505a043bf68dff7a00068dcc732b" 
DEBU[2024-12-19T12:10:45.397705565-06:00] Loaded image "sha256:247f7abff9f7097bbdab57df76fedd124d1e24a6ec4944fb5ef0ad128997ce05" 
INFO[2024-12-19T12:10:45.397993799-06:00] Start event monitor                          
INFO[2024-12-19T12:10:45.398018360-06:00] Start cni network conf syncer for default    
INFO[2024-12-19T12:10:45.398029840-06:00] Start streaming server                       
INFO[2024-12-19T12:10:45.398048749-06:00] Registered namespace "k8s.io" with NRI       
INFO[2024-12-19T12:10:45.398063782-06:00] runtime interface starting up...             
INFO[2024-12-19T12:10:45.398073226-06:00] starting plugins...                          
DEBU[2024-12-19T12:10:45.398596992-06:00] sd notification                               notified=false state="READY=1"
INFO[2024-12-19T12:10:45.398624749-06:00] containerd successfully booted in 0.106084s  

@buroa
Copy link

buroa commented Dec 19, 2024

@mikebrow It is controlled via Talos.

/etc/cri/containerd.toml:

version = 3

disabled_plugins = [
    "io.containerd.nri.v1.nri",
    "io.containerd.internal.v1.tracing",
    "io.containerd.snapshotter.v1.blockfile",
    "io.containerd.tracing.processor.v1.otlp",
]

imports = [
    "/etc/cri/conf.d/cri.toml",
]

[debug]
level = "info"
format = "json"

/etc/cri/conf.d/cri.toml:

## /etc/cri/conf.d/00-base.part
## /etc/cri/conf.d/01-registries.part
## /etc/cri/conf.d/20-customization.part

version = 3

[plugins]
  [plugins.'io.containerd.cri.v1.images']
    discard_unpacked_layers = false

    [plugins.'io.containerd.cri.v1.images'.registry]
      config_path = '/etc/cri/conf.d/hosts'

      [plugins.'io.containerd.cri.v1.images'.registry.configs]

  [plugins.'io.containerd.cri.v1.runtime']
    [plugins.'io.containerd.cri.v1.runtime'.containerd]
      [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes]
        [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
          base_runtime_spec = '/etc/cri/conf.d/base-spec.json'

cri logs:

m0: {"level":"info","msg":"starting containerd","revision":"88aa2f531d6c2922003cc7929e51daf1c14caa0a","time":"2024-12-19T18:21:03.434500951Z","version":"v2.0.1"}
m0: {"id":"io.containerd.image-verifier.v1.bindir","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.440084583Z","type":"io.containerd.image-verifier.v1"}
m0: {"id":"io.containerd.internal.v1.opt","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.440124824Z","type":"io.containerd.internal.v1"}
m0: {"id":"io.containerd.warning.v1.deprecations","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.440382397Z","type":"io.containerd.warning.v1"}
m0: {"id":"io.containerd.content.v1.content","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.440405140Z","type":"io.containerd.content.v1"}
m0: {"id":"io.containerd.snapshotter.v1.native","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.440652921Z","type":"io.containerd.snapshotter.v1"}
m0: {"id":"io.containerd.snapshotter.v1.overlayfs","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.440679415Z","type":"io.containerd.snapshotter.v1"}
m0: {"id":"io.containerd.event.v1.exchange","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.440757543Z","type":"io.containerd.event.v1"}
m0: {"id":"io.containerd.monitor.task.v1.cgroups","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.440779112Z","type":"io.containerd.monitor.task.v1"}
m0: {"id":"io.containerd.metadata.v1.bolt","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.441000539Z","type":"io.containerd.metadata.v1"}
m0: {"level":"info","msg":"metadata content store policy set","policy":"shared","time":"2024-12-19T18:21:03.441017450Z"}
m0: {"id":"io.containerd.gc.v1.scheduler","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665691156Z","type":"io.containerd.gc.v1"}
m0: {"id":"io.containerd.differ.v1.walking","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665722325Z","type":"io.containerd.differ.v1"}
m0: {"id":"io.containerd.lease.v1.manager","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665729328Z","type":"io.containerd.lease.v1"}
m0: {"id":"io.containerd.service.v1.containers-service","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665833114Z","type":"io.containerd.service.v1"}
m0: {"id":"io.containerd.service.v1.content-service","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665840920Z","type":"io.containerd.service.v1"}
m0: {"id":"io.containerd.service.v1.diff-service","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665849280Z","type":"io.containerd.service.v1"}
m0: {"id":"io.containerd.service.v1.images-service","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665854402Z","type":"io.containerd.service.v1"}
m0: {"id":"io.containerd.service.v1.introspection-service","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665859332Z","type":"io.containerd.service.v1"}
m0: {"id":"io.containerd.service.v1.namespaces-service","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665869284Z","type":"io.containerd.service.v1"}
m0: {"id":"io.containerd.service.v1.snapshots-service","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665873486Z","type":"io.containerd.service.v1"}
m0: {"id":"io.containerd.shim.v1.manager","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665877484Z","type":"io.containerd.shim.v1"}
m0: {"id":"io.containerd.runtime.v2.task","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.665882228Z","type":"io.containerd.runtime.v2"}
m0: {"id":"io.containerd.service.v1.tasks-service","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.666918419Z","type":"io.containerd.service.v1"}
m0: {"id":"io.containerd.grpc.v1.containers","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.667792234Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.content","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.667860497Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.diff","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.667922364Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.events","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.667963744Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.images","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.668002875Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.introspection","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.668062520Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.leases","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.668102305Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.namespaces","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.668149195Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.sandbox.store.v1.local","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.668190211Z","type":"io.containerd.sandbox.store.v1"}
m0: {"id":"io.containerd.cri.v1.images","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.668235797Z","type":"io.containerd.cri.v1"}
m0: {"level":"info","msg":"Get image filesystem path \"/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs\" for snapshotter \"overlayfs\"","time":"2024-12-19T18:21:03.668593661Z"}
m0: {"level":"info","msg":"Start snapshots syncer","time":"2024-12-19T18:21:03.668633781Z"}
m0: {"id":"io.containerd.cri.v1.runtime","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.668705890Z","type":"io.containerd.cri.v1"}
m0: {"config":"{\"containerd\":{\"defaultRuntimeName\":\"runc\",\"runtimes\":{\"runc\":{\"runtimeType\":\"io.containerd.runc.v2\",\"runtimePath\":\"\",\"PodAnnotations\":null,\"ContainerAnnotations\":null,\"options\":{\"BinaryName\":\"\",\"CriuImagePath\":\"\",\"CriuWorkPath\":\"\",\"IoGid\":0,\"IoUid\":0,\"NoNewKeyring\":false,\"Root\":\"\",\"ShimCgroup\":\"\"},\"privileged_without_host_devices\":false,\"privileged_without_host_devices_all_devices_allowed\":false,\"baseRuntimeSpec\":\"/etc/cri/conf.d/base-spec.json\",\"cniConfDir\":\"\",\"cniMaxConfNum\":0,\"snapshotter\":\"\",\"sandboxer\":\"podsandbox\",\"io_type\":\"\"}},\"ignoreBlockIONotEnabledErrors\":false,\"ignoreRdtNotEnabledErrors\":false},\"cni\":{\"binDir\":\"/opt/cni/bin\",\"confDir\":\"/etc/cni/net.d\",\"maxConfNum\":1,\"setupSerially\":false,\"confTemplate\":\"\",\"ipPref\":\"\",\"useInternalLoopback\":false},\"enableSelinux\":false,\"selinuxCategoryRange\":1024,\"maxContainerLogSize\":16384,\"disableApparmor\":false,\"restrictOOMScoreAdj\":false,\"disableProcMount\":false,\"unsetSeccompProfile\":\"\",\"tolerateMissingHugetlbController\":true,\"disableHugetlbController\":true,\"device_ownership_from_security_context\":false,\"ignoreImageDefinedVolumes\":false,\"netnsMountsUnderStateDir\":false,\"enableUnprivilegedPorts\":true,\"enableUnprivilegedICMP\":true,\"enableCDI\":true,\"cdiSpecDirs\":[\"/etc/cdi\",\"/var/run/cdi\"],\"drainExecSyncIOTimeout\":\"0s\",\"ignoreDeprecationWarnings\":null,\"containerdRootDir\":\"/var/lib/containerd\",\"containerdEndpoint\":\"/run/containerd/containerd.sock\",\"rootDir\":\"/var/lib/containerd/io.containerd.grpc.v1.cri\",\"stateDir\":\"/run/containerd/io.containerd.grpc.v1.cri\"}","level":"info","msg":"starting cri plugin","time":"2024-12-19T18:21:03.669557614Z"}
m0: {"id":"io.containerd.podsandbox.controller.v1.podsandbox","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.672698307Z","type":"io.containerd.podsandbox.controller.v1"}
m0: {"id":"io.containerd.sandbox.controller.v1.shim","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674117698Z","type":"io.containerd.sandbox.controller.v1"}
m0: {"id":"io.containerd.grpc.v1.sandbox-controllers","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674194517Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.sandboxes","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674229389Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.snapshots","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674244996Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.streaming.v1.manager","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674258253Z","type":"io.containerd.streaming.v1"}
m0: {"id":"io.containerd.grpc.v1.streaming","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674274613Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.tasks","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674287879Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.transfer.v1.local","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674305579Z","type":"io.containerd.transfer.v1"}
m0: {"id":"io.containerd.grpc.v1.transfer","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674336693Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.version","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674353904Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.monitor.container.v1.restart","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674367906Z","type":"io.containerd.monitor.container.v1"}
m0: {"id":"io.containerd.ttrpc.v1.otelttrpc","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674980001Z","type":"io.containerd.ttrpc.v1"}
m0: {"id":"io.containerd.grpc.v1.healthcheck","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674986527Z","type":"io.containerd.grpc.v1"}
m0: {"id":"io.containerd.grpc.v1.cri","level":"info","msg":"loading plugin","time":"2024-12-19T18:21:03.674990866Z","type":"io.containerd.grpc.v1"}
m0: {"level":"info","msg":"Connect containerd service","time":"2024-12-19T18:21:03.674994447Z"}
m0: {"level":"info","msg":"NRI service not found, NRI support disabled","time":"2024-12-19T18:21:03.675009240Z"}
m0: {"level":"info","msg":"Start subscribing containerd event","time":"2024-12-19T18:21:03.692998002Z"}
m0: {"level":"info","msg":"Start recovering state","time":"2024-12-19T18:21:03.693225460Z"}
m0: {"address":"/run/containerd/containerd.sock.ttrpc","level":"info","msg":"serving...","time":"2024-12-19T18:21:03.693346175Z"}
m0: {"address":"/run/containerd/containerd.sock","level":"info","msg":"serving...","time":"2024-12-19T18:21:03.693470428Z"}
m0: {"error":"unable to find sandbox \"198dd57200b65202ebf4a0de83772411e2fff2584b19ab2e4f9c2812ebdd4835\": not found","level":"error","msg":"failed to recover sandbox state","sandbox":"198dd57200b65202ebf4a0de83772411e2fff2584b19ab2e4f9c2812ebdd4835","tim
e":"2024-12-19T18:21:03.722923325Z"}
m0: {"error":"unable to find sandbox \"3943ce14bb4bf468a78d88e0ec55146b340cb7fec81e06881fc97fa501f81f2b\": not found","level":"error","msg":"failed to recover sandbox state","sandbox":"3943ce14bb4bf468a78d88e0ec55146b340cb7fec81e06881fc97fa501f81f2b","tim
e":"2024-12-19T18:21:03.722980125Z"}
m0: {"error":"unable to find sandbox \"4bab9618cadef9b6bd4c951a161aa6216c2cb47857e92da555ec5e85490ddd75\": not found","level":"error","msg":"failed to recover sandbox state","sandbox":"4bab9618cadef9b6bd4c951a161aa6216c2cb47857e92da555ec5e85490ddd75","tim
e":"2024-12-19T18:21:03.723015409Z"}
m0: {"error":"unable to find sandbox \"9fe74455d7c20887915c57eba694d1b5f94058c47a399b8bb9fdfe8b9113b00d\": not found","level":"error","msg":"failed to recover sandbox state","sandbox":"9fe74455d7c20887915c57eba694d1b5f94058c47a399b8bb9fdfe8b9113b00d","time":"2024-12-19T18:21:03.723099299Z"}
m0: {"error":"unable to find sandbox \"ab2842b2eaa8b921cd3765b9e9742aae7e4cb28d6bf47d2243e7b1eadf9088ca\": not found","level":"error","msg":"failed to recover sandbox state","sandbox":"ab2842b2eaa8b921cd3765b9e9742aae7e4cb28d6bf47d2243e7b1eadf9088ca","time":"2024-12-19T18:21:03.723130600Z"}
m0: {"error":"unable to find sandbox \"ad6748a056c8e9039a5f99abd04ddc1c2eebaab687a4fde5b87d38f5382012ff\": not found","level":"error","msg":"failed to recover sandbox state","sandbox":"ad6748a056c8e9039a5f99abd04ddc1c2eebaab687a4fde5b87d38f5382012ff","time":"2024-12-19T18:21:03.723164479Z"}
m0: {"error":"unable to find sandbox \"cacad76b3e8ffa31b5eafe101a2da4a1459082a733d1ce539cf73510ff75b675\": not found","level":"error","msg":"failed to recover sandbox state","sandbox":"cacad76b3e8ffa31b5eafe101a2da4a1459082a733d1ce539cf73510ff75b675","time":"2024-12-19T18:21:03.723193693Z"}
m0: {"level":"info","msg":"Start event monitor","time":"2024-12-19T18:21:04.026470728Z"}
m0: {"level":"info","msg":"Start cni network conf syncer for default","time":"2024-12-19T18:21:04.026575399Z"}
m0: {"level":"info","msg":"Start streaming server","time":"2024-12-19T18:21:04.026596422Z"}
m0: {"level":"info","msg":"containerd successfully booted in 0.592766s","time":"2024-12-19T18:21:04.026640435Z"}
m0: {"address":"unix:///run/containerd/s/962cf41f2ac48833a2be13852f068a1337459c1e8cb491130aa46184ec2f08a1","level":"info","msg":"connecting to shim kubelet","namespace":"system","protocol":"ttrpc","time":"2024-12-19T18:21:14.050198476Z","version":3}
m0: {"address":"unix:///run/containerd/s/c5a0037f7a540abcaab874cf70b1962448b8b6f36e28be60570eb3465684e0e5","level":"info","msg":"connecting to shim etcd","namespace":"system","protocol":"ttrpc","time":"2024-12-19T18:21:14.050276262Z","versio

@mikebrow
Copy link
Member

well now that you have containerd up you should be able to get kubelet to reconnect

@buroa
Copy link

buroa commented Dec 19, 2024

@mikebrow Once containerd gets in the bugged state, kubelet can never reconnect to it. It spams container runtime is down. I have provided as much information as I can here, for now I will wait until others receive this update and start to see issues as well.

@mikebrow
Copy link
Member

siderolabs/talos#9496 related perhaps

@buroa
Copy link

buroa commented Dec 19, 2024

Just to note: This is only happening when I reboot + have multus enabled. If I disable multus and reboot, the cluster comes up fine. Once the cluster settles... I can enable multus and nothing breaks. So it's a race somewhere on boot with cilium/multus/containerd.

@mikebrow
Copy link
Member

mikebrow commented Dec 19, 2024

is possible you have multiple problems.. as the sandbox store is also reporting errors in your cri log... note: containerd/containerd#10848 (comment)

@buroa
Copy link

buroa commented Dec 19, 2024

They happen on every Kubernetes cluster I have ever touched. Just look scary. Any time you reboot a node, it takes a minute or two to settle things.

@mikebrow
Copy link
Member

we have a config field in cni setup .. setup_serially set that to true ...

@mikebrow
Copy link
Member

    [plugins.'io.containerd.cri.v1.runtime'.containerd]
      default_runtime_name = 'runc'
      ignore_blockio_not_enabled_errors = false
      ignore_rdt_not_enabled_errors = false

      [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes]
        [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
          runtime_type = 'io.containerd.runc.v2'
          runtime_path = ''
          pod_annotations = []
          container_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          base_runtime_spec = ''
          cni_conf_dir = ''
          cni_max_conf_num = 0
          snapshotter = ''
          sandboxer = 'podsandbox'

          [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
            BinaryName = ''
            CriuImagePath = ''
            CriuWorkPath = ''
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            Root = ''
            ShimCgroup = ''

    [plugins.'io.containerd.cri.v1.runtime'.cni]
      bin_dir = '/opt/cni/bin'
      conf_dir = '/etc/cni/net.d'
      max_conf_num = 1
      setup_serially = false
      conf_template = ''
      ip_pref = ''

^ this guy

@buroa
Copy link

buroa commented Dec 19, 2024

@mikebrow On it, moment.

Update: still broke :(

@mikebrow
Copy link
Member

anyhow that would eliminate if there is a timing conflict setting them up in parallel

@MikeZappa87
Copy link
Contributor Author

If adding multus is causing the issue I would try to confirm that it’s a multus issue. One way you can do this is run cri-o, cilium and Multus. And if it doesn’t work I would try and use cri-o and cilium. If that works it’s a Multus issue. If that’s the case I would reach out to them as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) impact/changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants